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The Classification and Mixture Maximum Likelihood ! 

Approaches to Cluster Analysis* 

G.J. McLachlan 

1. INTRODUCTION 

A common and very old problem in statistics is the separation of a 
heterogeneous population into more homogeneous subpopulations. We con¬ 
centrate here on the situation where the population of interest, II, is 
known or assumed to consist of, say, k different subpopulations E^,... .E^, 
and where the density of a p-dimensional observation x from E^ is 
known or assumed to be f^xjO) for some unknown vector of parameters, 

0 (i=l,...,k). In this context the problem may be formulated as follows: 

Given a random sample of observations x.,...,x from n, attempt to 
allocate each Xj to the subpopulation to which it belongs. We let 

y' = (y,,...,Y ) denote the set of identifying labels, where Y. * i 
-In 3 

if x. comes from E.. This would be the classical discrimination 
~J i 

problem if Y were known a priori; a discrimination procedure would be 
*»» 

formed from the classified sample for the allocation of subsequent obser¬ 
vations of unknown origin. 

In what is sometimes called the classification maximum likelihood 
procedure, 6 and y are chosen to maximize 

n 

L_(x.,...,x ;0 ,y) ■ E f (x ;0) . (1.1) 

C ~ n ~ ~ j-1 ~ 

The maximization is over the set of values of y corresponding to 
all possible assignments of the x^ to the various subpopulations 
as well as over all admissible values of 0. The estimates of 0 
and Y 80 obtained are denoted by 0 and y respectively. The 

*To appear in Vol. II of the Handbook of Statistics (edited by P.R. 

Krishnalah and L. Kanal). 
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x^,...,x n are then classified according to the estimates Yj L »***»Y n J 

for example, x, is assigned to H if y, ■ g, This procedure 
—J 8 J 

has been considered by several authors Including Hartley and Rao [14], 
John [17], Scott and Symons [31], and Sclove [30]. Unfortunately, with 
this procedure, the Yj increase in number with the number of observa¬ 
tions, and under such conditions the maximum likelihood estimates need 
not be consistent. Marriott [23] pointed out that under the standard 
assumption of normal distributions with common variance matrices, this 
procedure gives definitely inconsistent estimates for the parameters 
involved. More recently, Bryant and Williamson [4] extended Marriott's 
results and showed that the method may be expected to give asymptotically 
biased results quite generally. 

A related approach is the mixture maximum likelihood method 
considered by Day [5], and Wolfe [34], among many others. With this 
approach x^,...,x n are assumed to be a random sample of size n 
from a mixture of 11^,...,II^ in the proportions (e^,...,e^) - e'. 

Hence the likelihood 


u IV 

Lj 1 (x 1 ,...,x n ;9,e) - H { £ e i f^x^jQ)} 


( 1 . 2 ) 


can be formed; the estimates of 6 and e obtained by maximizing 

•>* ««» 

A A 

(1.2) are denoted by 6 and e respectively. Each x^ can be 
classified then on the basis of the estimated posterior probabilities 


P,_ (i-l.k) formed by replacing 9 and e with 9 and e in 

ij — ~ ** 












It can be seen that the mixture approach is equivalent to the 
classification procedure with the additional assumption that 
Y^>...,y is an (unobservable) random sample from a probability 
distribution with mass e^, at i (i»l,...,k). It appears to avoid 
the asymptotic biases associated with the classification procedure 
where at each step in the iterative process of computing the maximum 
likelihood estimates each x is assigned outright to a particular sub- 
population according to the estimate for yj• B y contrast, the mixture 
approach does not insist on definite membership to any subpopulation; 
rather it gives an estimated probability of membership of each subpopulation. 

Note that another approach to this problem is to proceed further and 
adopt a Bayesian procedure in which all parameters are random variables 
(Binder [2], Symons [32]). 

A common assumption in practice is to adopt the normality model 


x. - N(y ,E) in H. (i-l,...,k) . 

~J - 1 ~ 1 


(1.3) 


In this case 8 has ~p(p+2k+l) elements, comprising the components 












of the k mean vectors y^ and the distinct elements of the common 
covariance matrix Z, and the density f^(x;0) is given by 

fOcjy^Z) - (2ir)~ 1/2p |lj~ 1/2 {exp (x-y ± ) ’ l" 1 (x-y i )} . 

We now proceed to consider the application of the classification and 
mixture approaches under the normality model (1.3) which is assumed 
to hold through to Section 5, where the condition of a common covari¬ 
ance matrix is relaxed to cover the general case of unequal covariance 
matrices. 


2. CLASSIFICATION APPROACH 


In principle the maximization process for the classification maximum 
likelihood procedure can be carried out since it is just a matter of 
computing the maximum value of the likelihood (1.1) over all possible 
partitions of the n observations to the k subpopulations. However, 
unless n is quite small, searching over all possible partitions is 
prohibitive. It follows that Yj * 8 if 


f(x.;y ,Z) ^ f(x.;y.,Z), (i*l,...,k) 

~J ~g ~ — ~i - 


( 2 . 1 ) 


where y^ and Z are the ordinary maximum likelihood estimates of 
y^ and Z for a sample of normal observations classified according 
to y- Hence the solution can be computed iteratively (John [17], 
Sclove [30]). Starting with some initial clustering Y, the y^ 
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and E are estimated accordingly and then used to give a new estimate 

of y on the basis (2.1), equivalent to allocating each observation to 

the nearest cluster centre in terms of the estimated Mahalanobis distance. 

Each step in the iterative process yields a value of the likelihood not 

less than that at the previous step, and the iterations may be continued 

until no observation changes clusters. Various starting values should be 

taken in an attempt to locate the global solution. It will be seen in 

the next section that the likelihood equations under the mixture approach 

can be easily modified to be applicable also under the classification 

approach. There are other procedures for finding the solution under the 

classification approach; for example, the Mahalanobis distance version 

of MacQueen*s [20] k-means procedure, where the y. and £ are re- 

■•I 

estimated after each observation is allocated rather than waiting until 
after all the observations have been allocated. 

For the classification approach applied under the normality model 
(1.3), Scott and Symons [31] showed that y corresponds to the partition 
which minimizes the determinant of the pooled within-subpopulations sum 
of squares matrix 


W - I W 
i-1 

where 


Si" \ ‘Si,-Sign'll’’ 


and x^ (q*l,...,n^) denote the n^ observations assinged to 1^ 











according to Y and x^ refers to their sample mean; see also 
Friedman and Rubin [9] who originally suggested this criterion. 

The minimization of |w| would appear to be a reasonable clustering 
criterion regardless of the underlying distributions. Marriott 
[22] has given a comprehensive account of the properties of this 
criterion. It does have the tendency to produce clusters of roughly 
equal size, although the modified version, 

k 

n log|w| - 2 J n log n 
1=1 1 

suggested recently by Symons [32], would appear to go some way to 
overcoming this. 


3. MIXTURE APPROACH 

An excellent account of the computation of the maximum likelihood 
estimates of V^.I, and e for the mixture approach has been given by 
Day [5]. Under the normality model (1.3), the posterior probabilities 
Py (i*l,... ,k;j=l.n) have the form 

k 

P y - exp(a^Xj+ b^/t ^exp(a^ x^+b^} 


where 


a 

~r 


f ^Sr - h> 
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and 


r^Ki-y + lD * (e r /e i ) 


for r = l,...,k; that is, = 0 and - 0. The maximum likeli¬ 
hood estimates are evaluated from the equations 


/v n ✓s 

e i - I (3.1) 

j-l 


A n A A 

= I (P..x )/(n e ) (3.2) 

~ J-l 1 


and 

Z - l l (P.,/n)(x -U.)(x -y )• , (3.3) 

~ i=l J-l ^ - 1 

which can be solved iteratively by substituting some initial values 
for the estimates into the right-hand side of (3.1) to (3.3) to 
produce new estimates on the left-hand side, which are then substi¬ 
tuted into the right-hand side, and so on. These iterative estimates 
can be identified with those obtained by directly applying the so- 
called EM algorithm of Dempster et al. [6], which shows that the 
estimates will converge to a local maximum irrespective of the 
starting point. The iterative process should be started from several 
points in an attempt to ensure that the global maximum is obtained. 
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Day [5] has shown that considerable computing time can be saved 
for k - 2 by reparametrizing the likelihood in terms of a, b, m, 
and V, where 


m 


£ lHl + e 2»2 


and 


V = E + e i e 2 ^i “~ ' 


and the mean and covariance matrix of the mixture distribution; a 
and b denote a 2 and b 2 with their subscripts suppressed since 
k * 2 only. The maximum likelihood equations now can be written as 


m ■ 1 V n > 

- j-1 ~ J 


v “ l (x. - m) (x - m) ’ /n , 
~ j-1 -3 ~ ~ 


(3.4) 


(3.5) 


A A i A A A A A A A^-J A A 

a * Y” <H2‘Hl >/{1 **l e 2 < Hl-H2 >, I ( iJi “ ]f2^ 


and 


b - - J + + lo 8^ E 2^ e i^ • 


(3.6) 


(3.7) 


A A 

Only values of a and b are needed in solving the above equations 

A A 

as m and V are given explicitly. 
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To obtain suitable initial values of a and b, it is suggested 
for various bivariate subsets of the variables plotting the data points 
and drawing a line which divides the data into two groups which have a 
scatter that appears normal (see, for example, O'Neill [28] and 
Ganesalingam and McLachlan [12]). Estimates of a and b can be 
formed on the basis of this subdivision, proceeding as if the observa¬ 
tions were correctly classified. There appears to be no difficulty in 
locating the global maximum for p = 1 and 2, but for p >_ 3 there 
are problems with multiple maxima, particularly for small values (less 
than two, say) of the Mahalanobis distance between 11^ and » 


A = {(y - y ) ' 

~ 2 





1/2 
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when n is not large (Day [5]). Also, it is well-known (Day [5] and 
Hosmer [16]) that maximum likelihood estimates based on a mixture of 
normal distributions are very poor unless n is very large (for 
example, n 500). However, Ganesalingam and McLachlan [11] found 

A A 

that although the maximum likelihood estimates a and b may not 
be very reliable for small n, it appears that the proportions in 

A A 

which the components of a and b occur are such that the resulting 

A. A 

discriminant function, a'x + b, may still provide reasonable separation 
between the subpopulations. 

Note that the same set of equations here can be used as follows 
to compute the estimates E, and y under the classification approach. 
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At a given step y is put equal to that g for which P > P 
J gJ - lj 

(i=l,...,k) where, in the P , is used without the 

log(e r /e^) term. Then on the next step the and Z are computed 

A 

from (3.1) to (3.3) in which, for each j, P is replaced by 1 
(i=g) and 0 (i^g). The transformed equations (3.4) to (3.7) for 
k=2 are also applicable to the classification approach with the above 

A 

modifications; that is, the term corresponding to in (3.6) is given 

A /A 

by n^/n (i=l,2) while there is no term corresponding to log^^/e^) in 
(3.7). 

A simulation study undertaken by Ganesalingam and McLachlan [13] 
for k=2 suggests that overall the mixture approach perforins quite 
favourably relative to the classification approach even where mixture 
sampling does not apply. The apparent slight superiority of the latter 
approach for samples with subpopulations represented in approximately 
equal numbers is more than offset by its inferior performance for 
disparate representations. 

4. EFFICIENCY OF THE MIXTURE APPROACH 

We consider now the efficiency of the mixture approach for k=2 
normal subpopulations, contrasting the asympotic theory with small 
sample results available from simulation. 

For a mixture of two univariate normal distributions Ganesalingam 
and McLachlan [10] studied the asymptotic efficiency of the mixture 
approach relative to the classical discrimination procedure (appropriate 
for known y) by considering the ratio 
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e - {E(R) - R q }/{E(R m ) - R q } 


* 


(4.1) 


where E(R^) and E(R) denote the unconditional error rate of the 
mixture and classical procedures respectively applied to an unclassi¬ 
fied observation subsequent to the inital sample, and R q denotes 
their common limiting value as n -*• ». The asymptotic relative 
efficiency was obtained by evaluating the numerator and denominator 
of (4.1) up to and including terms of order 1/n. The multivariate 
analogue of this problem was considered independently by O’Neill 
[28]. By definition the asymptotic relative efficiency does not 
depend on n, and O'Neill [28] showed that it also does not depend 
on p for equal prior probabilities, ■ 0.5. The asymptotic 

values of e are displayed in Table 1 as percentages for selected 
2 

combinations of A , p, and n; the corresponding values of e 
obtained from simulation are extracted from Ganesalingam and McLachlan 
[11] and listed below in parentheses. It can be seen that the asymptotic 
relative efficiency does not give a reliable guide as to the true 
relative efficiency when n is small, particularly for A «■ 1. This 
is not surprising since the asymptotic theory of maximum likelihood 
for this problem requires n to be very large before it applies (Day 
[5], Hosmer [16]). Further simulation studies by Ganesalingam and 
McLachlan [11] in the univariate case indicate that the asymptotic 
relative efficiency gives reliable predictions at least for n > 100 
and A > 2. 
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The simulated values for the relative efficiency in Table 1 
suggest that for the mixture approach to perform comparably with the 
classical discrimination procedure it needs to be based on about two 
to five times the number of initial observations, depending on the 
combination of the parameters. 

5. UNEQUAL COVARIANCE MATRICES 

For normal subpopulations II with unequal covariance matrices 
the classification procedure has to be applied with the restric¬ 
tion that at least p + 1 observations belong to each subpopulation 
to avoid the degenerate case of infinite likelihood. 

The likelihood equations under the mixture approach are given by 
(3.1) to (3.3) appropriately modified to allow for k different co- 
variance matrices (Wolfe [34]). Unfortunately, maximum likelihood 
estimation breaks down in practice for each data point gives rise to 
a singularity in the likelihood on the edge of the parameter space. 

This problem has received a good deal of attention recently. For a 
mixture of two univariate normal distributions, Kiefer [18] has shown 

A 

that the likelihood equations have a root $ which is a consistent, 

asymptotically normal and efficient estimator of <p * (8',e')'. Quandt 

and Ramsey [29] proposed the moment generating function (MGF) estimator 

obtained by minimizing 

h n t x 2 

l ty(t ) - l e 1 j /n) 
i»l 1 j-1 
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for selected values t^,...,t^ of t in some small interval (c,d), 
c < 0 < d, where 


V>(t) - I e. exp(v t+ ~ a 2 t 2 ) 
i-1 1 z 1 

is the MGF of a mixture of two normal distributions with variances 
2 2 

a, and • The usefulness of the MGF method would appear to be 
that it provides a consistent estimate which can be used as a starting 
value when applying the EM algorithm in an attempt to locate the root 
of the likelihood equations corresponding to the consistent, asympto¬ 
tically efficient estimator. Bryant [3] suggests taking the classifi¬ 
cation maximum likelihood estimate of 4> as a starting value in the 
likelihood equations. 

The robustness of the mixture approach based on normality as a 
clustering procedure requires investigation. A recent case study by 
Hernandez -A1vi [15] suggests that, at least in the case where the 
variables are in the form of proportions, the mixture approach may be 
reasonably robust from a clustering point of view of separating samples 
in the presence of multimodality. 

6. UNKNOWN NUMBER OF SUBPOPULATIONS 

Frequently with the application of clustering techniques there is 
the difficult problem of deciding how many subpopulations, k, there 
are. A review of this problem has been given by Everitt [8]; see also 
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Engelman and Hartigan [7] and Lee {19]. With respect to the classifica¬ 
tion approach Marriott [21] has suggested taking k to be the number 
which minimizes k |wj . For heterogeneous covariance matrices there 
may be some excessive subdivision, but this can be rectified by recombin¬ 
ing any two clusters which by themselves do not suggest separation was 
necessary. 

With the mixture approach the likelihood ratio test is an obvious 
criterion for choosing the number of subpopulations. However, for 
testing the hypothesis of, say, k^ versus k^ subpopulations 
(kf < k 2 ), it has been noted (Wolfe [35]) that some of the regularity 
conditions are not satisfied for minus twice the log - likelihood ratio 
to have under the null hypothesis an approximate chi-square distribution 
with degrees of freedom equal to the difference in the number of parameters 
in the two hypotheses. Wolfe [35] suggested using a chi-square distribution 
with twice the difference in the number of parameters (not including the 
proportions), which appears to be a reasonable approximation (Hernandez-Alvi 
[15]). 


7. PARTIAL CLASSIFICATION OF SAMPLE 


We now consider the situation where the classification of some of 
the observations in the sample is initially known. This information can 
be easily incorporated into the maximum likelihood procedures for the 
classification and mixture approaches. If an is known to come from, 

say n r , then under the former approach Yj " r always in the associated 
iterative process while, under the latter, P^ is set equal to l(i“ r) 
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and 0(ij*r) in all the iterations. In those situations where there 
are sufficient data of known classification to form a reliable discri¬ 
mination rule, the unclassified data can be clustered simply according 
to this rule and, for the classification approach, the results of 
McLachlan [24,25] suggest this may be preferable unless the unclassified 
data are in approximately the same proportion from each subpopulation. 

With the mixture approach a more efficient clustering of the unclassified 
observations should be obtained by simultaneously using them in the 
estimation of the subpopulation parameters, at least as n -*■ », since 
the procedure is asymptotically efficient. The question of whether it 
is a worthwhile exercise to update a discrimination rule on the basis of 
a limited number of unclassified observations has been considered recently 
by McLachlan and Ganesalingam [26]. For other work on the updating problem 
the reader is referred to Titterington [33], Murray and Titterington [27], 
and Anderson [1], 
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TABLE 1 



Asymptotic Versus Simulation Results for the 
Relative Efficiency of the Mixture Approach 



p=l 

o 

<N 

II 

G 

P=2 

, n=20 

P=3 

, n=40 

A 

Ej = 0.25 

E j =0.50 

Ej = 0.25 

Ej = 0.50 

E j = 0.25 

E, » 0.50 

1 

0.25 

0.51 

0.34 

0.51 

0.42 

0.51 


(33.01) 

(25.12) 

(46.71) 

(63.11) 

(25.00) 

(43.39) 

2 

7.29 

10.08 

9.36 

10.08 

10.51 

10.08 


(22.05) 

(17.74) 

(25.73) 

(16.26) 

(16.28) 

(14.51) 

3 

31.41 

35.92 

35.13 

35.92 

36.78 

35.92 


(19.57) 

(23.54) 

(43.91) 

(29.63) 

(29.01) 

(23.46) 
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"A review is undertaken of two maximum likelihood approaches to 
cluster analysis, the so-called classification and mixture maximum 
likelihood methods. The basic assumptions of the two approaches and 
their associated properties are contrasted, in particular for multi¬ 
variate normal component distributions. The problem of deciding how 
many clusters there are is discussed for each approach. Also, an 
account is given of the relative efficiency of the mixture approach 
to clustering. 
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